This file contains the data analysis as requested. I also developed a Dashboard in order to better display and interact with Outvio’s databases.
Regarding the data analysis, the first step is to load the required packages.
# Loading Packages
library('tidyverse') # Package for data manipulation
library('lubridate') # Manipulate date variables
library('plotly') # Package for data visualization
library('DT') # Render some nice HTML tables
After that, it is time to read and import the data. After import databases using R, it was necessary to fix some variables types and create useful features.
# Reading data ----
# Packages - Reading CSV data
packages <- read.csv('data/packages.csv', na.strings=c("NA","NaN", ""))
# Products - Reading CSV data
products <- read.csv('data/products.csv', na.strings=c("NA","NaN", ""))
# Shipments - Read CSV data and fix date formats
shipments <- read.csv('data/shipments.csv', na.strings=c("NA","NaN", "")) %>%
mutate(createdAt = lubridate::date(createdAt),
deliverDate = lubridate::date(deliverDate),
estimatedDeliverDate = lubridate::date(estimatedDeliverDate),
pickupDate = lubridate::date(pickupDate),
processDate = lubridate::date(processDate),
deliveryTime = as.numeric(difftime(deliverDate, createdAt, units = "days")), # Create DeliveryTime as the difference in days between createdAt and deliverDate
predictedDiff = as.numeric(difftime(estimatedDeliverDate, deliverDate, units = "days")), # Time difference between predicted and delivered dates
delayed = ifelse(deliverDate > estimatedDeliverDate, 'Delayed', 'On Time')) # Create a variable that classifies delivered orders into Delayed or On Time
This section presents the resolution of the minimum requirements of this task.
DeuschePost presents the highest deliveryTime, followed by transaher and fedex.
shipments %>%
filter(!is.na(deliverDate)) %>% # Remove not delivered orders
group_by(courier) %>%
summarise(mean = mean(deliveryTime)) %>%
arrange(desc(mean))
“dhl express - gpt - priority (packet tracked)” presents the highest deliveryTime.
shipments %>%
filter(!is.na(deliverDate)) %>% # Remove not delivered orders
group_by(courier) %>%
summarise(mean = mean(deliveryTime)) %>%
arrange(desc(mean))
shipments_aux <- shipments %>%
separate_rows(packages) %>%
filter(!packages %in% c('oid', ':', '')) # Unnest and create one row per package
packages_aux <- packages %>%
separate_rows(products) %>%
filter(!products %in% c('oid', ':', '')) # Unnest and create on row per product
products_per_order <- shipments_aux %>%
inner_join(packages_aux, by = c('packages' = 'X_id')) %>% # Join Shipments and Packages data
group_by(X_id) %>%
summarise(products = n_distinct(products)) %>% # Count # of products whithin each order
ungroup() %>%
summarise(products_per_order = mean(products))
c('In average, each order presents 2.85 products.')
## [1] "In average, each order presents 2.85 products."
In order to better explore the provided data bases, I formulate some questions to be answered through data analysis.
# Top 10 most popular Courier
shipments %>%
group_by(courier) %>%
summarise(count = n_distinct(X_id)) %>%
arrange(desc(count)) %>%
top_n(10) %>%
plot_ly(y = ~count,
x = ~reorder(courier,desc(count)),
type = 'bar')
# Top 10 most popular method
shipments %>%
group_by(method) %>%
summarise(count = n_distinct(X_id)) %>%
arrange(desc(count)) %>%
top_n(10) %>%
plot_ly(y = ~count,
x = ~reorder(method,desc(count)),
type = 'bar')
Comparing the effective deliver time with the predicted, 16% of the orders have been delayed through the analyze time period.
shipments %>%
filter(!is.na(delayed)) %>% # Remove not delivered orders
group_by(delayed) %>%
summarise(count = n_distinct(X_id)) %>%
ungroup() %>%
mutate(perc = count/sum(count)) %>% # Criando a visão percentual
plot_ly(y = ~perc,
x = ~delayed,
type = 'bar') %>%
layout(barmode = 'stack')
The plot is comparing On Time delivered orders versus Delayed orders per courier. Envialia presents a difficult situation, since it is the most popular courier and, regarding the main couriers, is the one that presents more delayed orders.
shipments %>%
filter(!is.na(delayed)) %>% # Remove not delivered orders
group_by(delayed, courier) %>%
summarise(count = n_distinct(X_id),
predictedDiff = mean(predictedDiff)) %>%
ungroup() %>%
group_by(courier) %>%
mutate(perc = count/sum(count)) %>% # Criando a visão percentual
ungroup() %>%
plot_ly(y = ~count,
x = ~reorder(courier, desc(count)),
color = ~delayed,
type = 'bar') %>%
layout(barmode = 'stack')
The plot is comparing On Time delivered orders versus Delayed orders per courier. Envialia presents a difficult situation, since it is the most popular courier and, regarding the main couriers, is the one that presents more delayed orders.
shipments %>%
filter(!is.na(delayed)) %>% # Remove not delivered orders
group_by(delayed, courier) %>%
summarise(count = n_distinct(X_id),
predictedDiff = mean(predictedDiff)) %>%
ungroup() %>%
group_by(courier) %>%
mutate(perc = count/sum(count)) %>% # Criando a visão percentual
ungroup() %>%
plot_ly(y = ~perc,
x = ~reorder(courier, desc(count)),
color = ~delayed,
type = 'bar') %>%
layout(barmode = 'stack')
60% of the delayed orders were delayed by only one day. Almost 90% of the delayed orders were delayed by a maximum of 5 days.
shipments %>%
filter(delayed == 'Delayed') %>% # Remove not delivered orders
group_by(predictedDiff) %>%
summarise(count = n_distinct(X_id)) %>%
ungroup() %>%
mutate(perc = count/sum(count)) %>% # Criando a visão percentual
plot_ly(x = ~predictedDiff,
y = ~perc,
type = 'bar')
One option to estimate which are the top performers courier is to look for the orders volume of delivers versus delivery Time.
shipments %>%
filter(!is.na(deliverDate)) %>% # Remove not delivered orders
group_by(courier) %>%
summarise(volume = n_distinct(X_id),
deliveryTime = mean(deliveryTime)) %>%
plot_ly(x = ~deliveryTime,
y = ~volume,
text = ~courier)
shipments %>%
filter(!is.na(deliverDate) & delayed == 'On Time') %>% # Remove not delivered orders
group_by(courier) %>%
summarise(volume = n_distinct(X_id),
deliveryTime = mean(deliveryTime)) %>%
plot_ly(x = ~deliveryTime,
y = ~volume,
text = ~courier)
shipments_aux <- shipments %>%
separate_rows(packages) %>%
filter(!packages %in% c('oid', ':', '')) # Unnest and create one row per package
packages_aux <- packages %>%
separate_rows(products) %>%
filter(!products %in% c('oid', ':', '')) # Unnest and create on row per product
shipments_aux %>%
inner_join(packages_aux, by = c('packages' = 'X_id')) %>% # Join Shipments and Packages data
group_by(X_id) %>%
summarise(products = n_distinct(products)) %>% # Count # of products whithin each order
ungroup() %>%
summarise(products_per_order = mean(products))